The UPC TweetMT participation: Translating Formal Tweets Using Context Information
نویسندگان
چکیده
In this paper, we describe the UPC systems that participated in the TweetMT shared task. We developed two main systems that were applied to the Spanish–Catalan language pair: a state-of-the-art phrase-based statistical machine translation system and a context-aware system. In the second approach, we define the “context” for a tweet as the tweets of a user produced in the same day, and also, we study the impact of this kind of information in the final translations when using a document-level decoder. A variant of this approach considers also semantic information from bilingual embeddings.
منابع مشابه
EHU at TweetMT: Adapting MT Engines for Formal Tweets
This paper describes the participation of the IXA group from the UPV/EHU (University of the Basque Country) in the TweetMT shared task at the SEPLN-2015 conference. We have adapted existing MT engines for the es-eu and eu-es pairs, obtaining good results (better than other experiments reported in previous work). Three main aspects are described: resource compilation, engine adaptation and results.
متن کاملOverview of TweetMT: A Shared Task on Machine Translation of Tweets at SEPLN 2015
This article presents an overview of the shared task that took place as part of the TweetMT workshop held at SEPLN 2015. The task consisted in translating collections of tweets from and to several languages. The article outlines the data collection and annotation process, the development and evaluation of the shared task, as well as the results achieved by the participants.
متن کاملDublin City University at the TweetMT 2015 Shared Task
We describe our participation in TweetMT for three language pairs in both directions: Spanish from/to Catalan, Basque and Portuguese. We used a range of techniques: statistical and rule-based MT, morph segmentation, data selection with ParFDA and system combination. As for resources, our focus was on crawling vast amounts of tweets to perform monolingual domain adaptation. Our system was the be...
متن کاملAn Analysis of Twitter Corpora and the Differences between Formal and Colloquial Tweets
This work reviews recent publications addressing the Twitter translation task, and highlights the lack of appropriate corpora that represents the colloquial language used in Twitter. It also discusses the most well-know issues in the Twitter genre: the use of hashtags and the amount of OOVs, with especial focus in comparing the differences between formal and colloquial texts. Resumen: Este trab...
متن کاملLanguage Segmentation of Twitter Tweets using Weakly Supervised Language Model Induction
This paper presents early results of a weakly supervised language model induction approach for language segmentation of multilingual texts with a special focus on short texts.
متن کامل